An Efficient Approach to Optimize the Performance of Massive Small Files in Hadoop MapReduce Framework

نویسندگان

  • Guru Prasad
  • Swathi Prabhu
چکیده

The most popular open source distributed computing framework called Hadoop was designed by Doug Cutting and his team, which involves thousands of nodes to process and analyze huge amounts of data called Big Data. The major core components of Hadoop are HDFS (Hadoop Distributed File System) and MapReduce. This framework is the most popular and powerful for store, manage and process Big Data applications. But drawback with this tool related to stability and performance issues for small file applications in storage, manage and processing the data. Existing approaches deals with small files problem are Hadoop archives and SequenceFile. However, existing approaches doesn’t give an optimized performance to solve small files problems on Hadoop. In order to improve the performance in storing, managing and processing small files on Hadoop, we proposed an approach for Hadoop MapReduce framework to handle the small files applications. Experimental result shows that proposed framework optimizes the performance of Hadoop in handling of massive small files as compared to existing approaches. Keywords-Hadoop, Hadoop Distributed File System (HDFS), MapReduce, Hadoop Archives, Sequence File, Small Files.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New HDFS Structure Model to Evaluate the Performance of Word Count Application on Different File Size

MapReduce is a powerful distributed processing model for large datasets. Hadoop is an open source framework and implementation of MapReduce. Hadoop distributed file system (HDFS) has become very popular to build large scale and high performance distributed data processing system. HDFS is designed mainly to handle big size files, so the processing of massive small files is a challenge in native ...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Live Website Traffic Analysis Integrated with Improved Performance for Small Files using Hadoop

Hadoop, an open source java framework deals with big data. It has HDFS (Hadoop distributed file system) and MapReduce. HDFS is designed to handle large amount files through clusters and suffers performance penalty while dealing with large number of small files. These large numbers of small files pose a heavy burden on the NameNode of HDFS and an increase execution time for MapReduce. Secondly, ...

متن کامل

Improving the Performance of Processing for Small Files in Hadoop: A Case Study of Weather Data Analytics

-Hadoop is an open source Apache project that supports master slave architecture, which involves one master node and thousands of slave nodes. Master node acts as the name node, which stores all the metadata of files and slave nodes acts as the data nodes, which stores all the application data. Hadoop is designed to process large data sets (petabytes). It becomes a bottleneck, when handling mas...

متن کامل

GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets

One of the most common datasets exploited by many corporations to conduct business intelligence analysis is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by certain key with the temporal ordering preserved to facilitate further analysis. One such example is to group temporally ordered events by user ID in order to analyze user behavio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017